Introduction¶
This notebook explores the impact of storm events in California from 2000 through 2024. It uses NOAA’s Storm Events Database to analyze event frequency and severity, including injuries, fatalities, and economic damage. The dataset was filtered to include only events with reported consequences and adjusted for inflation to reflect 2024 dollars. The analysis focuses on trends, high-impact events, and patterns across time and event types.
Methods¶
The following data processing steps were taken outside of this notebook:
- Data was downloaded directly from the NOAA Storm Events Database in a compressed format: https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/
- Only "details" files were used, which include key event attributes such as type, timing, location, injuries, deaths, and damage estimates.
- After extraction, the data was filtered to include only rows where at least one form of impact (injuries, deaths, or damage) was reported.
- Only events in California were retained.
- Relevant columns were kept and data was saved to a single compressed
.parquetfile for efficient loading and processing in this notebook.
Notes on Data Quality¶
- The NOAA Storm Events dataset, while incredibly useful, has known quality issues. A 2016 paper by Renato P. dos Santos (Some comments on the reliability of NOAA’s Storm Events Database) found that:
- Over half the records lacked damage estimates.
- Many entries contained errors or use non-standard event types.
- Narrative fields were inconsistently populated and vary in quality.
- Since 2016, some work has been done on the data, including standardizing the EVENT_TYPE variable.
- This analysis likely underestimates total damage:
- Missing values in damage columns were filled with zeroes.
- Events without any recorded injuries, deaths, or economic damage were filtered out during preprocessing.
- As a result, this dataset should be treated as incomplete rather than comprehensive. It's more suitable for exploratory visualizations than precise economic or policy conclusions.
Additional Information¶
- Monetary figures were adjusted for inflation to 2024 dollars using CPI data from the U.S. Federal Reserve: https://fred.stlouisfed.org/series/CPIAUCSL
Links¶
Load and Check Data¶
import os
import re
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
from IPython.display import display
# Set default plotly theme
pio.templates.default = "plotly_white"
# Set to "notebook" to reduce file size by factor of 4
pio.renderers.default = "notebook"
# Format pandas display for floats
pd.set_option('display.float_format', '{:,.0f}'.format)
# Set the path to the cleaned Parquet file
base_dir = os.path.dirname(os.path.abspath("__file__")) # Use notebook location
data_path = os.path.join(base_dir, "data", "cleaned_storm_data_ca.parquet")
# Load the data
df_raw = pd.read_parquet(data_path)
# Quick check
print(f"Loaded {len(df_raw):,} rows")
df_raw.head()
Loaded 6,974 rows
| STATE | EVENT_TYPE | CZ_NAME | BEGIN_DATE_TIME | END_DATE_TIME | INJURIES_DIRECT | INJURIES_INDIRECT | DEATHS_DIRECT | DEATHS_INDIRECT | DAMAGE_PROPERTY | DAMAGE_CROPS | EVENT_NARRATIVE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CALIFORNIA | Wildfire | SAN BERNARDINO | 06-JAN-00 04:26:00 | 06-JAN-00 10:00:00 | 1 | 0 | 0 | 0 | 30K | 0 | 0 |
| 1 | CALIFORNIA | Dense Fog | W RIVERSIDE T X NW | 12-JAN-00 05:40:00 | 12-JAN-00 08:30:00 | 1 | 0 | 0 | 0 | 25K | 0 | 0 |
| 2 | CALIFORNIA | Heavy Rain | SACRAMENTO | 22-JAN-00 18:00:00 | 24-JAN-00 23:00:00 | 1 | 0 | 0 | 0 | 15K | 0 | 0 |
| 3 | CALIFORNIA | High Wind | X SW SAN BERNARDINO | 05-JAN-00 19:00:00 | 06-JAN-00 13:30:00 | 1 | 0 | 0 | 0 | 200K | 0 | 0 |
| 4 | CALIFORNIA | High Wind | W RIVERSIDE T X NW | 05-JAN-00 19:00:00 | 06-JAN-00 13:50:00 | 1 | 0 | 0 | 0 | 200K | 0 | 0 |
# Look at data characteristics
df_raw.describe(include='all')
| STATE | EVENT_TYPE | CZ_NAME | BEGIN_DATE_TIME | END_DATE_TIME | INJURIES_DIRECT | INJURIES_INDIRECT | DEATHS_DIRECT | DEATHS_INDIRECT | DAMAGE_PROPERTY | DAMAGE_CROPS | EVENT_NARRATIVE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 6974 | 6974 | 6974 | 6974 | 6974 | 6974 | 6974 | 6974 | 6974 | 6974 | 6974 | 6974 |
| unique | 1 | 33 | 253 | 6007 | 5810 | 44 | 21 | 14 | 6 | 401 | 130 | 5050 |
| top | CALIFORNIA | Strong Wind | SAN BERNARDINO | 01-JAN-06 00:00:00 | 31-DEC-05 23:59:00 | 0 | 0 | 0 | 0 | 1.00K | 0.00K | 0 |
| freq | 6974 | 1618 | 803 | 16 | 14 | 6339 | 6820 | 6478 | 6826 | 882 | 4812 | 1863 |
# Check for missing values
df_raw.isnull().sum().sum()
0
Transform Data¶
Data loaded from a .parquet file needs to be changed to proper data types before analysis.
In this section, we transform the data in the following ways:
- Convert date and time columns to datetime format
- Convert any numeric fields to numeric format
- Convert the dollar-amount damage values from strings (sometimes with the letters K, M, or B) to numeric format
- Fill missing damage amounts with 0
- Adjust the damage column for inflation
- Sum up the direct and indirect injuries, direct and indirect deaths, and property and crop damage into respective totals
- Drop unused columns
# Combine both damage columns and find non-digit/non-dot characters
all_damage = pd.concat([df_raw['DAMAGE_PROPERTY'], df_raw['DAMAGE_CROPS']]).dropna()
non_numeric_chars = set()
for entry in all_damage:
non_numeric_chars.update(re.findall(r'[^\d\.]', str(entry)))
print(non_numeric_chars)
{'K', 'B', 'M'}
# Create a deep copy
df = df_raw.copy(deep=True)
# Transform time fields to datetime
df['BEGIN_DATE_TIME'] = pd.to_datetime(df['BEGIN_DATE_TIME'], format='%d-%b-%y %H:%M:%S', errors='coerce')
df['END_DATE_TIME'] = pd.to_datetime(df['END_DATE_TIME'], format='%d-%b-%y %H:%M:%S', errors='coerce')
for col in ['INJURIES_DIRECT', 'INJURIES_INDIRECT', 'DEATHS_DIRECT', 'DEATHS_INDIRECT']:
df[col] = pd.to_numeric(df[col], errors='coerce')
def parse_damage(value):
'''
Parse damage columns. These may have the characters "K", "M", or "B".
'''
if pd.isnull(value) or value in ['0', '0.00K', '0.00M', '0.00B']:
return 0.0
try:
num = float(value[:-1])
scale = value[-1].upper()
if scale == 'K':
return num * 1_000
elif scale == 'M':
return num * 1_000_000
elif scale == 'B':
return num * 1_000_000_000
else:
return float(value) # fallback if no scale
except:
return None
# Transform damage columns from string to numeric
df['DAMAGE_PROPERTY'] = df['DAMAGE_PROPERTY'].apply(parse_damage)
df['DAMAGE_CROPS'] = df['DAMAGE_CROPS'].apply(parse_damage)
# Fill missing damage with zero
df['DAMAGE_PROPERTY'] = df['DAMAGE_PROPERTY'].fillna(0)
df['DAMAGE_CROPS'] = df['DAMAGE_CROPS'].fillna(0)
# Check time columns for issues
print(df['BEGIN_DATE_TIME'].isnull().sum(), "invalid BEGIN_DATE_TIME entries")
print(df['END_DATE_TIME'].isnull().sum(), "invalid END_DATE_TIME entries")
# Check numeric columns
print(df[['INJURIES_DIRECT', 'INJURIES_INDIRECT', 'DEATHS_DIRECT', 'DEATHS_INDIRECT', 'DAMAGE_PROPERTY', 'DAMAGE_CROPS']].dtypes)
print(df[['INJURIES_DIRECT', 'INJURIES_INDIRECT', 'DEATHS_DIRECT', 'DEATHS_INDIRECT']].isnull().sum().sum())
print(df[['DAMAGE_PROPERTY', 'DAMAGE_CROPS']].isnull().sum())
0 invalid BEGIN_DATE_TIME entries 0 invalid END_DATE_TIME entries INJURIES_DIRECT int64 INJURIES_INDIRECT int64 DEATHS_DIRECT int64 DEATHS_INDIRECT int64 DAMAGE_PROPERTY float64 DAMAGE_CROPS float64 dtype: object 0 DAMAGE_PROPERTY 0 DAMAGE_CROPS 0 dtype: int64
# Adjust for inflation
# Load CPI data
cpi_df = pd.read_csv("data/CPIAUCSL_yearly.csv")
cpi_df['YEAR'] = pd.to_datetime(cpi_df['observation_date']).dt.year
cpi_dict = dict(zip(cpi_df['YEAR'], cpi_df['CPIAUCSL']))
# Define base CPI
base_cpi = cpi_dict[2024]
# Add YEAR and MONTH columns
df['YEAR'] = df['BEGIN_DATE_TIME'].dt.year
df['MONTH'] = df['BEGIN_DATE_TIME'].dt.month
# Compute inflation adjustment factor for each row
df['CPI_FACTOR'] = df['YEAR'].map(lambda y: base_cpi / cpi_dict.get(y, base_cpi))
# Adjust damage values
df['DAMAGE_PROPERTY_ADJ'] = df['DAMAGE_PROPERTY'] * df['CPI_FACTOR']
df['DAMAGE_CROPS_ADJ'] = df['DAMAGE_CROPS'] * df['CPI_FACTOR']
# Create additional columns
df['TOTAL_INJURIES'] = df['INJURIES_DIRECT'] + df['INJURIES_INDIRECT']
df['TOTAL_DEATHS'] = df['DEATHS_DIRECT'] + df['DEATHS_INDIRECT']
df['TOTAL_DAMAGE_ADJ'] = df['DAMAGE_PROPERTY_ADJ'] + df['DAMAGE_CROPS_ADJ']
# Drop unused columns
df = df.drop(columns=["STATE", "INJURIES_DIRECT", "INJURIES_INDIRECT", "DEATHS_DIRECT", "DEATHS_INDIRECT",
"DAMAGE_PROPERTY", "DAMAGE_CROPS", "CPI_FACTOR",
"DAMAGE_PROPERTY_ADJ", "DAMAGE_CROPS_ADJ"])
# Check the new data frame
df
| EVENT_TYPE | CZ_NAME | BEGIN_DATE_TIME | END_DATE_TIME | EVENT_NARRATIVE | YEAR | MONTH | TOTAL_INJURIES | TOTAL_DEATHS | TOTAL_DAMAGE_ADJ | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Wildfire | SAN BERNARDINO | 2000-01-06 04:26:00 | 2000-01-06 10:00:00 | 0 | 2000 | 1 | 1 | 0 | 54,654 |
| 1 | Dense Fog | W RIVERSIDE T X NW | 2000-01-12 05:40:00 | 2000-01-12 08:30:00 | 0 | 2000 | 1 | 1 | 0 | 45,545 |
| 2 | Heavy Rain | SACRAMENTO | 2000-01-22 18:00:00 | 2000-01-24 23:00:00 | 0 | 2000 | 1 | 1 | 0 | 27,327 |
| 3 | High Wind | X SW SAN BERNARDINO | 2000-01-05 19:00:00 | 2000-01-06 13:30:00 | 0 | 2000 | 1 | 1 | 0 | 364,358 |
| 4 | High Wind | W RIVERSIDE T X NW | 2000-01-05 19:00:00 | 2000-01-06 13:50:00 | 0 | 2000 | 1 | 1 | 0 | 364,358 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6969 | Wildfire | SAN BERNARDINO AND RIVERSIDE COUNTY VALLEYS - ... | 2024-07-21 00:00:00 | 2024-07-25 00:00:00 | Hawarden wildfire started 130 pm on July 21, 2... | 2024 | 7 | 0 | 0 | 10,000,000 |
| 6970 | Wildfire | SAN DIEGO COUNTY MOUNTAINS | 2024-07-24 13:00:00 | 2024-07-26 00:00:00 | Grove wildfire started around 1 pm PDT on July... | 2024 | 7 | 0 | 0 | 30,000 |
| 6971 | Thunderstorm Wind | SAN DIEGO | 2024-07-24 12:00:00 | 2024-07-24 15:00:00 | East Warners SDGE site had a gust to 50 mph at... | 2024 | 7 | 0 | 0 | 10,000 |
| 6972 | Thunderstorm Wind | SAN DIEGO | 2024-07-24 15:00:00 | 2024-07-24 17:00:00 | Pine Valley Horse mesonet reported gusts 53 mp... | 2024 | 7 | 0 | 0 | 10,000 |
| 6973 | Thunderstorm Wind | SAN DIEGO | 2024-07-24 14:00:00 | 2024-07-24 16:00:00 | Shockey Truck Trail SGDE measured wind gust 51... | 2024 | 7 | 0 | 0 | 10,000 |
6974 rows × 10 columns
# Define impact type color map
impact_colors = {
"damage": "firebrick",
"injuries": "royalblue",
"deaths": "seagreen"
}
Yearly totals over time¶
Let's plot the total damage, injuries, and fatalities by year over time.
# Aggregate yearly totals
yearly = df.groupby('YEAR').agg({
'TOTAL_DAMAGE_ADJ': 'sum',
'TOTAL_INJURIES': 'sum',
'TOTAL_DEATHS': 'sum'
}).reset_index()
# Create subplots
fig = make_subplots(rows=1, cols=3, subplot_titles=["Total Damage", "Total Injuries", "Total Deaths"])
# Plot total damage
fig.add_trace(go.Scatter(x=yearly['YEAR'], y=yearly['TOTAL_DAMAGE_ADJ'],
mode='lines+markers', name='Damage ($)', line=dict(color=impact_colors["damage"])),
row=1, col=1)
# Plot total injuries
fig.add_trace(go.Scatter(x=yearly['YEAR'], y=yearly['TOTAL_INJURIES'],
mode='lines+markers', name='Injuries', line=dict(color=impact_colors["injuries"])),
row=1, col=2)
# Plot total deaths
fig.add_trace(go.Scatter(x=yearly['YEAR'], y=yearly['TOTAL_DEATHS'],
mode='lines+markers', name='Deaths', line=dict(color=impact_colors["deaths"])),
row=1, col=3)
fig.update_layout(title_text="Yearly Totals: Damage, Injuries, and Deaths in California (2000-2024)",
height=400, width=1200, showlegend=False)
fig.show()
Top 10 event types by impact type (damage, injuries, and deaths)¶
# Aggregate by event type
event_summary = df.groupby('EVENT_TYPE').agg({
'TOTAL_DAMAGE_ADJ': 'sum',
'TOTAL_INJURIES': 'sum',
'TOTAL_DEATHS': 'sum'
}).reset_index()
# Get top 10 for each category
top_damage = event_summary.nlargest(10, 'TOTAL_DAMAGE_ADJ')
top_injuries = event_summary.nlargest(10, 'TOTAL_INJURIES')
top_deaths = event_summary.nlargest(10, 'TOTAL_DEATHS')
# Create subplots
fig = make_subplots(
rows=1, cols=3,
subplot_titles=["By Damage (log scale)", "By Injuries (log scale)", "By Deaths"],
horizontal_spacing=0.12)
# Bar for damage
fig.add_trace(go.Bar(
x=top_damage['EVENT_TYPE'],
y=top_damage['TOTAL_DAMAGE_ADJ'],
name='Damage',
marker_color=impact_colors['damage']
), row=1, col=1)
# Bar for injuries
fig.add_trace(go.Bar(
x=top_injuries['EVENT_TYPE'],
y=top_injuries['TOTAL_INJURIES'],
name='Injuries',
marker_color=impact_colors['injuries']
), row=1, col=2)
# Bar for deaths
fig.add_trace(go.Bar(
x=top_deaths['EVENT_TYPE'],
y=top_deaths['TOTAL_DEATHS'],
name='Deaths',
marker_color=impact_colors['deaths']
), row=1, col=3)
fig.update_layout(
title_text="Top 10 Event Types by Damage, Injuries, and Deaths in California (2000-2024)",
height=400,
width=1200,
showlegend=False)
# Set y to log scale for Damage and Injuries
fig.update_yaxes(type='log', row=1, col=1)
fig.update_yaxes(type='log', row=1, col=2)
fig.show()
Event type by frequency over time¶
Let's plot how often the top 5 most frequent event types occur each year.
# Count total number of events per EVENT_TYPE
most_common_types = df['EVENT_TYPE'].value_counts().nlargest(5).index.tolist()
# Filter for those event types
filtered_df = df[df['EVENT_TYPE'].isin(most_common_types)].copy()
# Group and count
freq_by_year = filtered_df.groupby(['YEAR', 'EVENT_TYPE']).size().reset_index(name='COUNT')
# Plot
fig = px.line(
freq_by_year,
x='YEAR',
y='COUNT',
color='EVENT_TYPE',
title='Event Type Frequency Over Time (Top 5 Types), California (2000-2024)',
markers=True,
color_discrete_sequence=px.colors.qualitative.Dark2
)
fig.update_layout(height=400, width=1200, legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="center", x=0.5))
fig.update_traces(opacity=0.9)
fig.show()
Bubble plot of damage vs fatalities by event type¶
# Group and summarize, including start date
bubble_data = df.groupby('EVENT_TYPE').agg({
'TOTAL_DAMAGE_ADJ': 'sum',
'TOTAL_DEATHS': 'sum',
'TOTAL_INJURIES': 'sum',
'BEGIN_DATE_TIME': 'min' # Earliest event date
}).reset_index()
bubble_data['BEGIN_DATE_TIME'] = bubble_data['BEGIN_DATE_TIME'].dt.strftime('%Y-%m-%d')
# Plot with labels
fig = px.scatter(
bubble_data,
x='TOTAL_DAMAGE_ADJ',
y='TOTAL_DEATHS',
size='TOTAL_INJURIES',
hover_name='EVENT_TYPE',
hover_data={'BEGIN_DATE_TIME': True},
text='EVENT_TYPE', # Add this line to show event names as labels
title='Damage vs. Fatalities by Event Type (Bubble Size = Injuries), California (2000–2024)',
labels={
'TOTAL_DAMAGE_ADJ': 'Total Damage (Adj $)',
'TOTAL_DEATHS': 'Total Deaths',
'TOTAL_INJURIES': 'Total Injuries',
'BEGIN_DATE_TIME': 'Start Date'
},
height=500,
width=1200,
color_discrete_sequence=['#1f77b4']
)
fig.update_traces(
marker=dict(opacity=0.6, line=dict(width=1, color='DarkSlateGrey')),
textposition='top center' # Position labels above bubbles
)
fig.update_layout(xaxis_type='log', yaxis_type='log')
fig.show()
High-impact events tables¶
Here we create three tables showing the top 10 most severe events by total damage, total deaths, and total injuries.
top_damage_events = (
df[['BEGIN_DATE_TIME', 'EVENT_TYPE', 'TOTAL_DAMAGE_ADJ', 'EVENT_NARRATIVE']]
.sort_values(by='TOTAL_DAMAGE_ADJ', ascending=False)
.head(5)
.copy()
)
top_death_events = (
df[['BEGIN_DATE_TIME', 'EVENT_TYPE', 'TOTAL_DEATHS', 'EVENT_NARRATIVE']]
.sort_values(by='TOTAL_DEATHS', ascending=False)
.head(5)
.copy()
)
top_injury_events = (
df[['BEGIN_DATE_TIME', 'EVENT_TYPE', 'TOTAL_INJURIES', 'EVENT_NARRATIVE']]
.sort_values(by='TOTAL_INJURIES', ascending=False)
.head(5)
.copy()
)
display(top_damage_events)
display(top_death_events)
display(top_injury_events)
| BEGIN_DATE_TIME | EVENT_TYPE | TOTAL_DAMAGE_ADJ | EVENT_NARRATIVE | |
|---|---|---|---|---|
| 4863 | 2018-11-08 06:33:00 | Wildfire | 21,238,016,726 | The Camp Fire began on the morning of November... |
| 3595 | 2014-12-01 00:00:00 | Drought | 1,987,820,797 | Impacts to the region included: Reservoirs dro... |
| 4920 | 2018-07-23 15:00:00 | Wildfire | 1,873,942,652 | The Carr Fire began on the afternoon of July 2... |
| 609 | 2003-10-25 17:37:00 | Wildfire | 1,784,157,375 | 0 |
| 608 | 2003-10-25 09:00:00 | Wildfire | 1,187,278,735 | 0 |
| BEGIN_DATE_TIME | EVENT_TYPE | TOTAL_DEATHS | EVENT_NARRATIVE | |
|---|---|---|---|---|
| 4863 | 2018-11-08 06:33:00 | Wildfire | 86 | The Camp Fire began on the morning of November... |
| 1622 | 2006-07-16 15:00:00 | Heat | 30 | Fresno/Madera/Merced |
| 4921 | 2018-01-09 04:00:00 | Debris Flow | 21 | Intense rainfall generated a large and deadly ... |
| 5430 | 2020-09-07 21:00:00 | Wildfire | 16 | Immediate evacuation orders where given to Ber... |
| 1624 | 2006-07-16 15:00:00 | Heat | 16 | Tulare |
| BEGIN_DATE_TIME | EVENT_TYPE | TOTAL_INJURIES | EVENT_NARRATIVE | |
|---|---|---|---|---|
| 4921 | 2018-01-09 04:00:00 | Debris Flow | 168 | Intense rainfall generated a large and deadly ... |
| 609 | 2003-10-25 17:37:00 | Wildfire | 90 | 0 |
| 5455 | 2020-09-05 08:00:00 | Excessive Heat | 77 | Parts of the area rose up to 110 to 115 degree... |
| 3791 | 2014-10-04 12:00:00 | Heat | 75 | One of the local television stations and San D... |
| 5393 | 2020-07-12 09:00:00 | Wildfire | 68 | One of the Navy's vessels in the San Diego Bay... |